Extraction of Training Sets for Experimentation with Cross Language Information Retrieval Systems
نویسندگان
چکیده
In this paper we focus on methods, models and tools for the extraction of bilingual training / test sets useful for the (semi) automatic classification of textual documents. Such documents could be tutorials, technical specifications, articles, personal notes, etc. Another motivation for our research is the need for managing corpus of classified texts and especially parallel corpora (texts). We discuss the usage of pre-selected key-phrases as attributes for classification, and methods for classifying new documents. These methods could be applied to training data and produce (infer) the corresponding models. We also describe and discuss the classification of various document (textual) types, which is supported by our prototype tool.
منابع مشابه
A New Method for Improving Computational Cost of Open Information Extraction Systems Using Log-Linear Model
Information extraction (IE) is a process of automatically providing a structured representation from an unstructured or semi-structured text. It is a long-standing challenge in natural language processing (NLP) which has been intensified by the increased volume of information and heterogeneity, and non-structured form of it. One of the core information extraction tasks is relation extraction wh...
متن کاملParallel Sentences Mining From The Web
Parallel sentences can benefit many NLP applications (e.g., machine translation, cross language information retrieval.) In this paper, the candidate bilingual webs pages are returned by submit sentence pairs to search engine and then validated by surface patterns. We propose an algorithm to candidate bilingual resource extraction and filter useless bilingual web pages. The pair sentences includ...
متن کاملA Corpus for Cross-Document Co-reference
This paper describes a newly created text corpus of news articles that has been annotated for cross-document co-reference. Being able to robustly resolve references to entities across document boundaries will provide a useful capability for a variety of tasks, ranging from practical information retrieval applications to challenging research in information extraction and natural language underst...
متن کاملTExtractor: a multilingual terminology extraction tool
This demonstration presents a tool (TExtractor) employed for enriching terminology sets in four languages: English, French, German and Spanish. We present the associated linguistic resources and the experimental results obtained in the medical domain. TExtractor has been developed within project LIQUID (IST-2000-25324), which aims at developing a cost-effective solution for the problem of cross...
متن کاملPublic Transport Ontology for Passenger Information Retrieval
Passenger information aims at improving the user-friendliness of public transport systems while influencing passenger route choices to satisfy transit user’s travel requirements. The integration of transit information from multiple agencies is a major challenge in implementation of multi-modal passenger information systems. The problem of information sharing is further compounded by the multi-l...
متن کامل